pacman::p_load(sf, tidyverse, tmap, spdep, funModeling, ggpubr, corrplot,
heatmaply, cluster, factoextra)Take-home Exercise 2: Regionalisation of Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods
1 Overview
The process of creating regions is called regionalisation. A regionalisation is a special kind of clustering where the objective is to group observations which are similar in their statistical attributes, but also in their spatial location. In this sense, regionalization embeds the same logic as standard clustering techniques, but also applies a series of geographical constraints. Often, these constraints relate to connectivity: two candidates can only be grouped together in the same region if there exists a path from one member to another member that never leaves the region. These paths often model the spatial relationships in the data, such as contiguity or proximity. However, connectivity does not always need to hold for all regions, and in certain contexts it makes sense to relax connectivity or to impose different types of geographic constraints.
1.1 Getting Started
1.1.1 Load Packages
For our analysis, we will utilize the following packages:
sf - for importing and processing geospatial data,
tidyverse - for importing and processing non-spatial data. In this exercise, readr package will be used for importing wkt data and dplyr package will be used to wrangling the data.
We will run the following code chunk to load the required packages:
1.1.2 Import Data
1.1.2.1 Importing water point data
wp_nga <- read_csv("data/aspatial/WPdx.csv") %>%
filter(`#clean_country_name` == "Nigeria")Thing to learn from the code chunk above:
The original file name is called Water_Point_Data_Exchange_-_PlusWPdx.csv, it has been rename to WPdx.csv for easy encoding.
Instead of using
read.csv()of Base R to import the csv file into R,read_csv()is readr package is used. This is because during the initial data exploration, we notice that there is at least one field name with space between the field name (ie. New Georeferenced Column)The data file contains water point data of many countries. In this study, we are interested on water point in Nigeria on. Hence,
filter()of dplyr is used to extract out records belong to Nigeria only.
Convert wkt data
After the data are imported into R environment, it is a good practice to review both the data structure and the data table if it is in tibble data frame format in R Studio.
Notice that the newly imported tibble data frame (i.e. wp_nga) contains a field called New Georeferenced Column which represent spatial data in a textual format. In fact, this kind of text file is popularly known as Well Known Text in short wkt.
Two steps will be used to convert an asptial data file in wkt format into a sf data frame by using sf.
First, st_as_sfc() of sf package is used to derive a new field called Geometry as shown in the code chunk below.
wp_nga$Geometry <- st_as_sfc(wp_nga$`New Georeferenced Column`)Next, st_sf() will be used to convert the tibble data frame into sf data frame.
wp_nga <- st_sf(wp_nga, crs=4326) %>% st_transform(crs = 26391)
wp_ngaWhen the process completed, a new sf data frame called wp_sf will be created.
1.1.2.2 Importing Nigeria LGA level boundary data
For the purpose of this exercise, shapefile downloaded from geoBoundaries portal will be used.
nga <- st_read(dsn = "data/geospatial",
layer = "nga_admbnda_adm2_osgof_20190417",
crs = 4326) %>%
st_transform(crs = 26391) %>%
select(3:4,8:9,17)Reading layer `nga_admbnda_adm2_osgof_20190417' from data source
`C:\hulwana\ISSS624\Take-Home_Ex\Take-Home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 774 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
1.2 Data Preparation
Before proceeding to the geospatial analysis, we will first prepare the data.
1.2.1 Checking duplicated area names
We will first check if there are any duplicated areas by running the following code chunk:
dup <- nga$ADM2_EN[duplicated(nga$ADM2_EN)]
dup[1] "Bassa" "Ifelodun" "Irepodun" "Nasarawa" "Obi" "Surulere"
From the above, we see that areas Bassa, Ifelodun, Irepodun, Nasarawa, Obi and Surulere have duplicated labelling.
We will then plot the duplicated areas to determine to understand where are the areas with duplicated names
dup_areas <- nga %>%
filter(ADM2_EN %in% dup) %>%
select(ADM2_EN, geometry)
state_borders <- nga %>%
select(ADM1_EN, geometry)
tmap_mode("view")
tm_shape(state_borders) +
tm_fill("ADM1_EN") +
tm_shape(dup_areas) +
tm_polygons("ADM2_EN", alpha = 0.08) +
tm_layout(legend.show = FALSE)tmap_mode("plot")Upon searching on the web based on the information gathers in nigeriainfopedia, we realized that the duplication in names exist due to areas having similar names in different state. The states at which these areas are located are as follows:
dup_areas_state <- nga %>%
filter(ADM2_EN %in% dup) %>%
select(ADM2_EN, ADM1_EN)
dup_areas_stateSimple feature collection with 12 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 99926.41 ymin: 271934.4 xmax: 729231.7 ymax: 893313
Projected CRS: Minna / Nigeria West Belt
First 10 features:
ADM2_EN ADM1_EN geometry
1 Bassa Kogi MULTIPOLYGON (((555599.8 44...
2 Bassa Plateau MULTIPOLYGON (((704592.8 70...
3 Ifelodun Kwara MULTIPOLYGON (((273735.2 55...
4 Ifelodun Osun MULTIPOLYGON (((255291.7 43...
5 Irepodun Kwara MULTIPOLYGON (((305947.5 46...
6 Irepodun Osun MULTIPOLYGON (((235603 4279...
7 Nasarawa Kano MULTIPOLYGON (((677128.6 89...
8 Nasarawa Nasarawa MULTIPOLYGON (((608258.1 51...
9 Obi Benue MULTIPOLYGON (((663064.8 34...
10 Obi Nasarawa MULTIPOLYGON (((727739.2 48...
Since these areas have duplicated names, it might result in an inaccurate analysis, and therefore has to be recoded by executing the following code chunk:
nga$ADM2_EN[nga$ADM2_EN == "Bassa" & nga$ADM1_EN == "Kogi"] <- "Bassa (Kogi)"
nga$ADM2_EN[nga$ADM2_EN == "Bassa" & nga$ADM1_EN == "Plateau"] <- "Bassa (Plateau)"
nga$ADM2_EN[nga$ADM2_EN == "Ifelodun" & nga$ADM1_EN == "Kwara"] <- "Ifelodun (Kwara)"
nga$ADM2_EN[nga$ADM2_EN == "Ifelodun" & nga$ADM1_EN == "Osun"] <- "Ifelodun (Osun)"
nga$ADM2_EN[nga$ADM2_EN == "Irepodun" & nga$ADM1_EN == "Kwara"] <- "Irepodun (Kwara)"
nga$ADM2_EN[nga$ADM2_EN == "Irepodun" & nga$ADM1_EN == "Osun"] <- "Irepodun (Osun)"
nga$ADM2_EN[nga$ADM2_EN == "Nasarawa" & nga$ADM1_EN == "Kano"] <- "Nasarawa (Kano)"
nga$ADM2_EN[nga$ADM2_EN == "Nasarawa" & nga$ADM1_EN == "Nasarawa"] <- "Nasarawa (Nasarawa)"
nga$ADM2_EN[nga$ADM2_EN == "Obi" & nga$ADM1_EN == "Benue"] <- "Obi (Benue)"
nga$ADM2_EN[nga$ADM2_EN == "Obi" & nga$ADM1_EN == "Nasarawa"] <- "Obi (Nasarawa)"
nga$ADM2_EN[nga$ADM2_EN == "Surulere" & nga$ADM1_EN == "Lagos"] <- "Surulere (Lagos)"
nga$ADM2_EN[nga$ADM2_EN == "Surulere" & nga$ADM1_EN == "Oyo"] <- "Surulere (Oyo)"Check if there are duplicated in LGA names after the clean-up
nga$ADM2_EN[duplicated(nga$ADM2_EN)]character(0)
1.3 Data Wrangling
1.3.1 Extract all the required variables and recode if needed
Since we would like to understand if there are any relation ship on the number of functional and non-functional point, we would need to ensure that the variables required are cleaned. We will first load the data to see what are the fields present by using the glimpse() function.
glimpse(wp_nga)In total there are 71 fields each having 95,008 observations. Thus, we will select only the required columns needed for our analysis.
The data required for our analysis are:
Total number of functional water points
Total number of nonfunctional water points
Percentage of functional water points
Percentage of non-functional water points
Percentage of main water point technology (i.e. Hand Pump)
Percentage of usage capacity (i.e. < 1000, >=1000)
Percentage of rural water points
Thus we will:
Select the columns `#water_tech_category`, `#status_clean` and is_urban
Additional columns selected: `#subjective_quality` and usage_capacity
Tidy the name of variables that starts with “#”
wp_rev <- wp_nga %>%
select(10,22,26,46,47) %>%
rename(`water_tech` = `#water_tech_category`, `status_clean` = `#status_clean`,
`quality` = `#subjective_quality` )Since, we are interested to know how many functional and non-functional taps there are, we execute the following code to count the number of functional and non-functional taps as well as get the percentages of each type of taps.
freq(data=wp_rev,
input = 'status_clean')
status_clean frequency percentage cumulative_perc
1 Functional 45883 48.29 48.29
2 Non-Functional 29385 30.93 79.22
3 Unknown 10656 11.22 90.44
4 Functional but needs repair 4579 4.82 95.26
5 Non-Functional due to dry season 2403 2.53 97.79
6 Functional but not in use 1686 1.77 99.56
7 Abandoned/Decommissioned 234 0.25 99.81
8 Abandoned 175 0.18 99.99
9 Non functional due to dry season 7 0.01 100.00
1.3.1.1 Recoding NA values into String
We observed that there are more than 10% of observations that are NAs for this field. Thus, we will recode it into ‘Unknown’.
wp_rev <- wp_rev %>%
dplyr::mutate(status_clean =
replace_na(status_clean, "Unknown"))1.3.2 Extracting Water Point Data
We will re-group the water point categories into the following
Unknown
Functional
Non-functional
1.3.2.1 Extracting Water Point with Unknown Class
In the code chunk below, filter() of dplyr is used to select water points with unknown status.
wpt_unknown <- wp_rev %>%
filter(status_clean == "Unknown")1.3.2.2 Extracting Functional Water Point
In the code chunk below, filter() of dplyr is used to select functional water points.
We will consider the following categories as functional water points:
Functional
Functional but not in use
Functional but needs repair
wpt_functional <- wp_rev %>%
filter(status_clean %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))1.3.2.3 Extracting Non-Functional Water Point
On the other hand, the following categories, will be grouped as non-functional water points:
Non-Functional
Non-Functional due to dry season
Abandoned/Decommissioned
Abandoned
Non functional due to dry season
wpt_nonfunctional <- wp_rev %>%
filter(status_clean %in%
c("Non-Functional",
"Non-Functional due to dry season",
"Abandoned/Decommissioned",
"Abandoned",
"Non functional due to dry season"))1.3.2.4 Performing Point-in-Polygon Count
To count the number of different categories of water points found by LGA, we will utilize the mutate() function for the calculation as shown in the code:
nga_wp <- nga %>%
mutate(`total wpt` = lengths(
st_intersects(nga, wp_rev))) %>%
mutate(`wpt functional` = lengths(
st_intersects(nga, wpt_functional))) %>%
mutate(`wpt non-functional` = lengths(
st_intersects(nga, wpt_nonfunctional))) %>%
mutate(`wpt unknown` = lengths(
st_intersects(nga, wpt_unknown)))1.3.2.5 Compute the Percentages of Water Points
To compute the percentages of functional and non-functional water points, we execute the following code
nga_wp <- nga_wp %>%
mutate(pct_functional = `wpt functional`/`total wpt`) %>%
mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`)1.3.3 Extracting Water Technology Data
To see what are the different types of water technology present as well as its distribution, we run the following code:
freq(data=wp_rev,
input = 'water_tech')
water_tech frequency percentage cumulative_perc
1 Hand Pump 58755 61.84 61.84
2 Mechanized Pump 25644 26.99 88.83
3 <NA> 10055 10.58 99.41
4 Tapstand 553 0.58 99.99
5 Rope and Bucket 1 0.00 100.00
Observed that the dominating type of water technology belongs to the ‘Hand Pump’ category at 61.84% of the water point found in Nigeria. As the number of ‘Mechanized Pump’ is substantially large we will also consider the percentage of this type of water point technology in our analysis. The number of water points that are either ‘Tapstand’ and ‘Rope and Bucket’ is too small and thus will not be considered as a variable in our analysis.
1.3.3.1 Extracting Hand Pump Water Points
wpt_hand <- wp_rev %>%
filter(water_tech == "Hand Pump")1.3.3.2 Extracting Mechanized Pump Water Points
wpt_mechanized <- wp_rev %>%
filter(water_tech == "Mechanized Pump")1.3.3.3 Performing Point-in-Polygon Count
To count the number of different categories of water point techinlogies found in each LGA, we will utilize the mutate() function for the calculation as shown in the code:
nga_wp <- nga_wp %>%
mutate(`wpt hand` = lengths(
st_intersects(nga_wp, wpt_hand))) %>%
mutate(`wpt mechanized` = lengths(
st_intersects(nga_wp, wpt_mechanized)))1.3.3.4 Compute the Percentages of Hand Pump and Mechanized Pump Water Points
To compute the percentages of functional and non-functional water points, we execute the following code:
nga_wp <- nga_wp %>%
mutate(pct_hand = `wpt hand`/`total wpt`) %>%
mutate(`pct_mechanized` = `wpt mechanized`/`total wpt`)1.3.4 Extracting Rural and Urban Areas
1.3.4.1 Extract data on rural areas
wpt_rural <- wp_rev %>%
filter(is_urban == FALSE)1.3.4.2 Performing Point-in-Polygon Count
nga_wp <- nga_wp %>%
mutate(`wpt rural` = lengths(
st_intersects(nga_wp, wpt_rural)))1.3.4.3 Compute the Percentages of Rural Areas
nga_wp <- nga_wp %>%
mutate(pct_rural = `wpt rural`/`total wpt`)1.3.5 Extracting Quality
1.3.5.1 Different Categories for Quality
freq(data=wp_rev,
input = 'quality')
quality frequency percentage cumulative_perc
1 Acceptable quality 71801 75.57 75.57
2 <NA> 10625 11.18 86.75
3 No because of Taste 6712 7.06 93.81
4 No because of Colour 3986 4.20 98.01
5 No because of Odour 1847 1.94 99.95
6 Within National limits (potable) 19 0.02 99.97
7 Within National standards (potable) 18 0.02 100.00
1.3.5.2 Acceptable Quality
wpt_acceptable <- wp_rev %>%
filter(quality %in%
c("Acceptable quality",
"Within National standards (potable)",
"Within National limits (potable)"))1.3.5.3 Unacceptable Quality
wpt_unacceptable <- wp_rev %>%
filter(quality %in%
c("No because of Taste",
"No because of Colour",
"No because of Odour"))1.3.5.4 Performing Point-in-Polygon Count
nga_wp <- nga_wp %>%
mutate(`wpt acceptable` = lengths(
st_intersects(nga_wp, wpt_acceptable))) %>%
mutate(`wpt unacceptable` = lengths(
st_intersects(nga_wp, wpt_unacceptable)))1.3.5.5 Compute the Percentages of Acceptable and Unacceptable Water Quality
nga_wp <- nga_wp %>%
mutate(pct_acceptable = `wpt acceptable`/`total wpt`) %>%
mutate(pct_unacceptable = `wpt unacceptable`/`total wpt`)1.3.6 Usage Capacity
1.3.6.1 Extracting Usage Capacity Data
freq(data=wp_rev,
input = 'usage_capacity')
usage_capacity frequency percentage cumulative_perc
1 300 68789 72.40 72.40
2 1000 25644 26.99 99.39
3 250 573 0.60 99.99
4 50 2 0.00 100.00
We see that there are 2 groups with stubstantial number of observations which are 300 and 1000. Thus, we will recode the groups to those with less than or equal to 300 and more than 300.
1.3.6.2 Usage capacity of 300 or lesser
wpt_less300 <- wp_rev %>%
filter(usage_capacity < 301)1.3.6.3 Usage capacity greater than 300
wpt_more300 <- wp_rev %>%
filter(usage_capacity > 300)1.3.6.4 Performing Point-in-Polygon Count
nga_wp <- nga_wp %>%
mutate(`wpt less300` = lengths(
st_intersects(nga_wp, wpt_less300))) %>%
mutate(`wpt more300` = lengths(
st_intersects(nga_wp, wpt_more300)))1.3.6.5 Compute the Percentages of Usage Capacity less than or greater than 301
nga_wp <- nga_wp %>%
mutate(pct_less300 = `wpt less300`/`total wpt`) %>%
mutate(pct_more300 = `wpt more300`/`total wpt`)1.3.7 Saving Data
We will then save the cleaned data in rds format.
2 Exploratory Data Analysis
2.1 Summary Statistics
Let us review the summary statistics of the newly derived penetration rates using the code chunk below.
summary(nga_wp) ADM2_EN ADM2_PCODE ADM1_EN ADM1_PCODE
Length:774 Length:774 Length:774 Length:774
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
geometry total wpt wpt functional wpt non-functional
MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00 Min. : 0.00
epsg:26391 : 0 1st Qu.: 45.0 1st Qu.: 17.00 1st Qu.: 12.25
+proj=tmer...: 0 Median : 96.0 Median : 45.50 Median : 34.00
Mean :122.7 Mean : 67.36 Mean : 41.60
3rd Qu.:168.8 3rd Qu.: 87.75 3rd Qu.: 60.75
Max. :894.0 Max. :752.00 Max. :278.00
wpt unknown pct_functional pct_non-functional wpt hand
Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. : 0.00
1st Qu.: 0.00 1st Qu.:0.3333 1st Qu.:0.2211 1st Qu.: 6.00
Median : 0.00 Median :0.4792 Median :0.3559 Median : 47.00
Mean : 13.76 Mean :0.5070 Mean :0.3654 Mean : 75.89
3rd Qu.: 17.75 3rd Qu.:0.6749 3rd Qu.:0.5082 3rd Qu.:111.00
Max. :219.00 Max. :1.0000 Max. :1.0000 Max. :764.00
NA's :13 NA's :13
wpt mechanized pct_hand pct_mechanized wpt rural
Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. : 0.00
1st Qu.: 11.00 1st Qu.:0.1860 1st Qu.:0.1250 1st Qu.: 23.00
Median : 25.50 Median :0.5255 Median :0.3193 Median : 64.00
Mean : 33.12 Mean :0.4956 Mean :0.3818 Mean : 97.45
3rd Qu.: 46.00 3rd Qu.:0.7857 3rd Qu.:0.5843 3rd Qu.:141.00
Max. :245.00 Max. :1.0000 Max. :1.0000 Max. :894.00
NA's :13 NA's :13
pct_rural wpt acceptable wpt unacceptable pct_acceptable
Min. :0.0000 Min. : 0.00 Min. : 0.0 Min. :0.0000
1st Qu.:0.5922 1st Qu.: 28.00 1st Qu.: 3.0 1st Qu.:0.5595
Median :0.8717 Median : 71.00 Median : 9.0 Median :0.7759
Mean :0.7395 Mean : 92.79 Mean : 16.2 Mean :0.7172
3rd Qu.:1.0000 3rd Qu.:127.00 3rd Qu.: 21.0 3rd Qu.:0.9221
Max. :1.0000 Max. :833.00 Max. :225.0 Max. :1.0000
NA's :13 NA's :13
pct_unacceptable wpt less300 wpt more300 pct_less300
Min. :0.00000 Min. : 0.00 Min. : 0.00 Min. :0.0000
1st Qu.:0.04124 1st Qu.: 16.00 1st Qu.: 11.00 1st Qu.:0.4157
Median :0.10526 Median : 60.00 Median : 25.50 Median :0.6807
Mean :0.15565 Mean : 89.59 Mean : 33.12 Mean :0.6182
3rd Qu.:0.21429 3rd Qu.:127.75 3rd Qu.: 46.00 3rd Qu.:0.8750
Max. :1.00000 Max. :767.00 Max. :245.00 Max. :1.0000
NA's :13 NA's :13
pct_more300
Min. :0.0000
1st Qu.:0.1250
Median :0.3193
Mean :0.3818
3rd Qu.:0.5843
Max. :1.0000
NA's :13
2.2 EDA using Histogram
Here, we take a look at the distribution of the percentages variable
functional <- ggplot(data=nga_wp,
aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
nonfunctional <- ggplot(data=nga_wp,
aes(x= `pct_non-functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hand <- ggplot(data=nga_wp,
aes(x= `pct_hand`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
mechanized <- ggplot(data=nga_wp,
aes(x= `pct_mechanized`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
rural <- ggplot(data=nga_wp,
aes(x= `pct_rural`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
acceptable <- ggplot(data=nga_wp,
aes(x= `pct_acceptable`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
unacceptable <- ggplot(data=nga_wp,
aes(x= `pct_unacceptable`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
less300 <- ggplot(data=nga_wp,
aes(x= `pct_less300`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
more300 <- ggplot(data=nga_wp,
aes(x= `pct_more300`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(functional, nonfunctional, hand, mechanized, acceptable, unacceptable,
less300, more300, rural,
ncol = 3,
nrow = 3)
Notice that the variables pct_acceptable, pct_cap300 and pct_rural are skewed to the left. Whereas, variables pct_mechanized, pct_unacceptable and pct_cap1000 are skewed to the right.
2.3 EDA using Choropleth Map
2.3.1 Total Waterpoints
qtm(nga_wp, "total wpt")
From the above map, we notice that there are a number of areas in the north-eastern part of Nigeria in which there 0 data on the number of water points.
2.3.2 Distribution of Functional and Non-functional Water Points by Percentages
tmap_mode("view")
pct_functional <- tm_shape(nga_wp) +
tm_fill("pct_functional") +
tm_borders()
pct_nonfunctional <- tm_shape(nga_wp) +
tm_fill("pct_non-functional") +
tm_borders()
tmap_arrange(pct_functional, pct_nonfunctional,
asp = 1, ncol = 2,
sync = TRUE)tmap_mode("plot")In terms of functional water points, areas in the northern region have high percentages in comparison to other parts in Nigeria, while in terms of non-functional water points, areas in southern part of Nigeria have higher percentages.
2.3.3 Distribution of Hand Pump and Mechanized Pump
tmap_mode("view")
pct_hand <- tm_shape(nga_wp) +
tm_fill("pct_hand", palette = "BuGn") +
tm_borders()
pct_mechanized <- tm_shape(nga_wp) +
tm_fill("pct_mechanized", palette = "BuGn") +
tm_borders()
tmap_arrange(pct_hand, pct_mechanized,
asp = 1, ncol = 2,
sync = TRUE)tmap_mode("plot")We see a similar trend to that of the distribution of functional and non-functional water points, where the norther regions of Nigeria tend to have higher percentages of hand pump but lower percentage of mechanized pump whereas southern regions of Nigeria tended to have higher percentages of mechanized pumps but lower percentages of hand pumps. Perhaps this could be attributed to the technology and development of the different regions and therefore we shouldl proceed to look if there is a similar trend in terms of rural and urban areas.
2.3.4 Distribution of Rural Areas by Percentages
tmap_mode("view")
tm_shape(nga_wp) +
tm_fill("pct_rural", palette = "YlGn") +
tm_borders(col = "black", lwd = 2)tmap_mode("plot")Surprisingly, in terms of percentage of rural areas, it is almost homogeneous throughout Nigeria.
2.3.5 Distribution of Water Quality by Percentages
tmap_mode("view")
pct_accept <- tm_shape(nga_wp) +
tm_fill("pct_acceptable", palette = "Blues") +
tm_borders()
pct_unaccept <- tm_shape(nga_wp) +
tm_fill("pct_unacceptable", palette = "Blues") +
tm_borders()
tmap_arrange(pct_accept, pct_unaccept,
asp = 1, ncol = 2,
sync = TRUE)tmap_mode("plot")Based on the percentages of acceptable water quality, we see that the the distribution is also quite homogeneous among the areas in Nigeria with majority having high percentage of acceptable water quality.
2.3.6 Distribution of Usage Capacity
tmap_mode("view")
pct_less300 <- tm_shape(nga_wp) +
tm_fill("pct_less300", palette = "BuPu") +
tm_borders()
pct_more300 <- tm_shape(nga_wp) +
tm_fill("pct_more300", palette = "BuPu") +
tm_borders()
tmap_arrange(pct_less300, pct_more300,
asp = 1, ncol = 2,
sync = TRUE)tmap_mode("plot")We observe that the high percentages of usage capacity that is equal or greater than 1000 is prevalent in the southern part of Nigeria and some areas towards the extreme north-western region of Nigeria.
2.3.7 Further Data Preparation
From the above EDA, we notice that there are a number of areas in the north eastern part which seems to have missing water point data. To tackle this issue, we have 2 possible approaches by either
excluding those with unknown or missing data points from the analysis or
to impute an estimated value.
In our case, as the observations with NAs stems from no water point data for most of the fields (functional water points, usage capacity, etc.) we will omit these observations so as to obtain a more accurate analysis.
nga_wp <- nga_wp %>%
filter(`total wpt` > 0)2.4 Multicollinearity
2.4.1 Correlation Analysis
Before we perform cluster analysis, it is important for us to ensure that the cluster variables are not highly correlated.
In this section, you will learn how to use corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.
nga_wp2 <- nga_wp %>%
select(7,8,10,11,14,15,17,20,21,24,25) %>%
st_drop_geometry()
cluster_vars.cor = cor(nga_wp2)
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
Based on the correlation value computed, the following variables have high correlation of 0.8 and above:
pct_hand and pct_mechanized
pct_hand and pct_less300
pct_hand and pct_more300
pct_mechanized and pct_less300
pct_mechanized and pct_more300
pct_less300 and pct_more300
Since these variables are highly correlated to each other, we should remove one from each pair of correlated variables. In our case, we will exclude the variables pct_mechanized, pct_less300 and pct_more 300.
3 Clustering Analysis
In this section, we will perform hierarchical clustering.
3.1 Additional Data Preparation
Before we even proceed with the clustering, we need to further prepare the data.
3.1.1 Extracting clustering variables
The code chunk below will be used to extract the clustering variables from the nga_wp2 simple feature object into data.frame.
As hierarchical clustering does not consider geospatial data, thus it will be excluded by the code st_set_geometry().
nga_clust <- nga_wp %>%
select(1,7,8,10,11,14,17,20,21) %>%
st_drop_geometry()Notice that the final clustering variables list does not include variable pct_cap300 and pct_cap1000 because it is highly correlated with the variables pct_hand and pct _mechanized respectively.
Next, we need to change the rows by LGA instead of row number by using the code chunk below:
row.names(nga_clust) <- nga_clust$"ADM2_EN"
head(nga_clust,10) ADM2_EN wpt functional wpt non-functional pct_functional
Aba North Aba North 7 9 0.4117647
Aba South Aba South 29 35 0.4084507
Abaji Abaji 23 34 0.4035088
Abak Abak 23 25 0.4791667
Abakaliki Abakaliki 82 42 0.3519313
Abeokuta North Abeokuta North 16 15 0.4705882
Abeokuta South Abeokuta South 72 33 0.6050420
Abi Abi 79 62 0.5197368
Aboh-Mbaise Aboh-Mbaise 18 26 0.2727273
Abua/Odual Abua/Odual 25 13 0.6410256
pct_non-functional pct_hand pct_rural pct_acceptable
Aba North 0.5294118 0.11764706 0.00000000 0.7647059
Aba South 0.4929577 0.09859155 0.05633803 0.8028169
Abaji 0.5964912 0.40350877 0.84210526 0.9824561
Abak 0.5208333 0.08333333 0.83333333 0.7291667
Abakaliki 0.1802575 0.43776824 0.87553648 0.4206009
Abeokuta North 0.4411765 0.14705882 0.20588235 0.7352941
Abeokuta South 0.2773109 0.16806723 0.00000000 0.8655462
Abi 0.4078947 0.59868421 0.95394737 0.5657895
Aboh-Mbaise 0.3939394 0.01515152 0.72727273 0.5303030
Abua/Odual 0.3333333 0.30769231 0.53846154 0.8717949
pct_unacceptable
Aba North 0.17647059
Aba South 0.09859155
Abaji 0.01754386
Abak 0.27083333
Abakaliki 0.11158798
Abeokuta North 0.17647059
Abeokuta South 0.01680672
Abi 0.36184211
Aboh-Mbaise 0.13636364
Abua/Odual 0.10256410
Notice that the row number has been replaced into the LGA name. This is because hierarchical clustering does not need the LGA name. However, since we need to reference back to the LGA when we deduce the insights, thus we retain it as the rownames/ object id.
Now, we will delete the ADM2_EN field by using the code chunk below.
nga_clust <- select(nga_clust, c(2:9))
head(nga_clust, 10) wpt functional wpt non-functional pct_functional
Aba North 7 9 0.4117647
Aba South 29 35 0.4084507
Abaji 23 34 0.4035088
Abak 23 25 0.4791667
Abakaliki 82 42 0.3519313
Abeokuta North 16 15 0.4705882
Abeokuta South 72 33 0.6050420
Abi 79 62 0.5197368
Aboh-Mbaise 18 26 0.2727273
Abua/Odual 25 13 0.6410256
pct_non-functional pct_hand pct_rural pct_acceptable
Aba North 0.5294118 0.11764706 0.00000000 0.7647059
Aba South 0.4929577 0.09859155 0.05633803 0.8028169
Abaji 0.5964912 0.40350877 0.84210526 0.9824561
Abak 0.5208333 0.08333333 0.83333333 0.7291667
Abakaliki 0.1802575 0.43776824 0.87553648 0.4206009
Abeokuta North 0.4411765 0.14705882 0.20588235 0.7352941
Abeokuta South 0.2773109 0.16806723 0.00000000 0.8655462
Abi 0.4078947 0.59868421 0.95394737 0.5657895
Aboh-Mbaise 0.3939394 0.01515152 0.72727273 0.5303030
Abua/Odual 0.3333333 0.30769231 0.53846154 0.8717949
pct_unacceptable
Aba North 0.17647059
Aba South 0.09859155
Abaji 0.01754386
Abak 0.27083333
Abakaliki 0.11158798
Abeokuta North 0.17647059
Abeokuta South 0.01680672
Abi 0.36184211
Aboh-Mbaise 0.13636364
Abua/Odual 0.10256410
3.2 Data Standardization
In general, multiple variables will be used in cluster analysis. It is not unusual their values range are different. In order to avoid the cluster analysis result is baised to clustering variables with large values, it is useful to standardise the input variables before performing cluster analysis.
3.2.1 Min-Max Standardization
In the code chunk below, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method. The summary() is then used to display the summary statistics of the standardized clustering variables.
nga_clust.std <- normalize(nga_clust)
summary(nga_clust.std) wpt functional wpt non-functional pct_functional pct_non-functional
Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.02394 1st Qu.:0.05036 1st Qu.:0.3333 1st Qu.:0.2211
Median :0.06250 Median :0.12230 Median :0.4792 Median :0.3559
Mean :0.09110 Mean :0.15218 Mean :0.5070 Mean :0.3654
3rd Qu.:0.11702 3rd Qu.:0.21942 3rd Qu.:0.6749 3rd Qu.:0.5082
Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
pct_hand pct_rural pct_acceptable pct_unacceptable
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.1860 1st Qu.:0.5922 1st Qu.:0.5595 1st Qu.:0.04124
Median :0.5255 Median :0.8717 Median :0.7759 Median :0.10526
Mean :0.4956 Mean :0.7395 Mean :0.7172 Mean :0.15565
3rd Qu.:0.7857 3rd Qu.:1.0000 3rd Qu.:0.9221 3rd Qu.:0.21429
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
Notice that the values range of the Min-max standardized clustering variables are 0-1 now.
3.2.2 Z-SCORE STANDARDIZATION
Z-score standardization can be performed easily by using scale() of Base R. The code chunk below will be used to stadardization the clustering variables by using Z-score method.
nga_clust.z <- scale(nga_clust)
describe(nga_clust.z)nga_clust.z
8 Variables 761 Observations
--------------------------------------------------------------------------------
wpt functional
n missing distinct Info Mean Gmd .05 .10
761 0 198 1 3.739e-17 0.892 -0.7944 -0.7574
.25 .50 .75 .90 .95
-0.6220 -0.2649 0.2400 0.9296 1.8408
lowest : -0.8436052 -0.8312915 -0.8189779 -0.8066643 -0.7943506
highest: 5.3132110 5.5594836 6.4337515 7.0986877 8.4162463
--------------------------------------------------------------------------------
wpt non-functional
n missing distinct Info Mean Gmd .05 .10
761 0 140 1 4.166e-17 1.063 -1.1158 -1.0618
.25 .50 .75 .90 .95
-0.7647 -0.2244 0.5050 1.3424 1.8827
lowest : -1.142840 -1.115827 -1.088813 -1.061800 -1.034786
highest: 3.476478 3.638560 3.800641 4.692088 6.366929
--------------------------------------------------------------------------------
pct_functional
n missing distinct Info Mean Gmd .05 .10
761 0 619 1 4.388e-17 1.142 -1.4794 -1.2332
.25 .50 .75 .90 .95
-0.7384 -0.1182 0.7143 1.5062 1.7435
lowest : -2.155978 -2.082654 -2.027105 -1.919710 -1.872456
highest: 2.041621 2.055563 2.058539 2.079565 2.096853
--------------------------------------------------------------------------------
pct_non-functional
n missing distinct Info Mean Gmd .05 .10
761 0 618 1 1.072e-16 1.139 -1.62178 -1.36438
.25 .50 .75 .90 .95
-0.69793 -0.04573 0.69082 1.34989 1.75056
lowest : -1.767485 -1.747821 -1.723906 -1.720521 -1.704663
highest: 2.358458 2.378783 2.416136 2.483486 3.069827
--------------------------------------------------------------------------------
pct_hand
n missing distinct Info Mean Gmd .05 .10
761 0 619 1 1.95e-17 1.151 -1.5333 -1.4366
.25 .50 .75 .90 .95
-0.9577 0.0927 0.8977 1.3315 1.4369
lowest : -1.533321 -1.505447 -1.503854 -1.493139 -1.492068
highest: 1.529392 1.532834 1.543799 1.543951 1.560645
--------------------------------------------------------------------------------
pct_rural
n missing distinct Info Mean Gmd .05 .10
761 0 447 0.975 1.374e-16 1.038 -2.3488 -1.7713
.25 .50 .75 .90 .95
-0.4677 0.4200 0.8274 0.8274 0.8274
lowest : -2.3488119 -2.3224529 -2.3191273 -2.3160670 -2.3046972
highest: 0.8050784 0.8075325 0.8083123 0.8151828 0.8274464
--------------------------------------------------------------------------------
pct_acceptable
n missing distinct Info Mean Gmd .05 .10
761 0 607 1 1.841e-16 1.095 -2.0743 -1.3839
.25 .50 .75 .90 .95
-0.6504 0.2418 0.8449 1.0605 1.1491
lowest : -2.958071 -2.833631 -2.741002 -2.628126 -2.583133
highest: 1.149485 1.151308 1.151677 1.156726 1.166251
--------------------------------------------------------------------------------
pct_unacceptable
n missing distinct Info Mean Gmd .05
761 0 564 0.999 -2.492e-17 0.9772 -0.9163
.10 .25 .50 .75 .90 .95
-0.8793 -0.6735 -0.2966 0.3452 1.1947 2.0273
lowest : -0.9163197 -0.9027234 -0.8955168 -0.8949892 -0.8923879
highest: 4.4011564 4.4356855 4.4999096 4.7932548 4.9708860
--------------------------------------------------------------------------------
Notice the mean and standard deviation of the Z-score standardised clustering variables are 0 and 1 respectively.
Note: describe() of psych package is used here instead of summary() of Base R because the earlier provides standard deviation.
Warning: Z-score standardisation method should only be used if we would assume all variables come from some normal distribution.
3.2.3 Visualizing the Standardized Clustering Variables
Beside reviewing the summary statistics of the standardized clustering variables, it is also a good practice to visualize their distribution graphical.
The code chunk below plot the scaled pct_hand field.
r <- ggplot(data=nga_clust,
aes(x= `pct_acceptable`)) +
geom_histogram(bins=40,
color="black",
fill="light blue")
nga_clust.std_df <- as.data.frame(nga_clust.std)
s <- ggplot(data=nga_clust.std_df,
aes(x=`pct_acceptable`)) +
geom_histogram(bins=40,
color="black",
fill="light blue") +
ggtitle("Min-Max Standardisation")
nga_clust.z_df <- as.data.frame(nga_clust.z)
z <- ggplot(data=nga_clust.z_df,
aes(x=`pct_acceptable`)) +
geom_histogram(bins=40,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation")
ggarrange(r, s, z,
ncol = 1,
nrow = 3)
As the field pct_acceptable already has values bounded by 0 and 1, we see that histogram for min-max standardization is very similar to that of the histogram of the without scalling. We also noticed that the histogram for Z-score standardization does not differ significantly to that of the original percentage values. This is because the percentages have small ranges and thus standardization has very little impact on spreading of the data.
3.3 Computing Proximity Matrix
In R, many packages provide functions to calculate distance matrix. We will compute the proximity matrix by using dist() of R.
dist() supports six distance proximity calculations, they are: euclidean, maximum, manhattan, canberra, binary and minkowski. The default is euclidean proximity matrix.
The code chunk below is used to compute the proximity matrix using euclidean method.
proxmat <- dist(nga_clust, method = 'euclidean')3.4 Computing Hierarchical Clustering
3.4.1 Selecting the Optimal Clustering Algorithm
One of the challenge in performing hierarchical clustering is to identify stronger clustering structures. The issue can be solved by using use agnes() function of cluster package. It functions like hclus(), however, with the agnes() function you can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).
The code chunk below will be used to compute the agglomerative coefficients of all hierarchical clustering algorithms.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(nga_clust, method = x)$ac
}
map_dbl(m, ac) average single complete ward
0.9882357 0.9693362 0.9934373 0.9980031
With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.
3.4.2 Determining Optimal Clusters
Another technical challenge face by data analyst in performing clustering analysis is to determine the optimal clusters to retain.
There are three commonly used methods to determine the optimal clusters, they are:
3.4.2.1 Gap Statistic Method
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
To compute the gap statistic, clusGap() of cluster package will be used.
set.seed(1234)
gap_stat <- clusGap(nga_clust,
FUN = hcut,
nstart = 25,
K.max = 10,
B = 50)
print(gap_stat, method = "firstmax")Clustering Gap statistic ["clusGap"] from call:
clusGap(x = nga_clust, FUNcluster = hcut, K.max = 10, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstmax'): 1
logW E.logW gap SE.sim
[1,] 9.743872 10.915103 1.1712313 0.01421546
[2,] 9.491691 10.448103 0.9564123 0.03825944
[3,] 9.184797 10.235642 1.0508451 0.02453956
[4,] 9.088353 10.114359 1.0260062 0.01928817
[5,] 8.975447 10.018547 1.0430998 0.01690510
[6,] 8.911855 9.926594 1.0147393 0.01564758
[7,] 8.882740 9.839255 0.9565150 0.01647153
[8,] 8.806243 9.760503 0.9542597 0.01760538
[9,] 8.725137 9.692023 0.9668862 0.01666427
[10,] 8.668661 9.632051 0.9633895 0.01753653
Also note that the hcut function used is from factoextra package.
Next, we can visualize the plot by using fviz_gap_stat() of factoextra package.
fviz_gap_stat(gap_stat)
With reference to the gap statistic graph above, the recommended number of cluster to retain is 1. However, it is not logical to retain only one cluster. By examine the gap statistic graph, the 3-cluster gives the largest gap statistic and should be the next best cluster to pick.
fviz_nbclust(nga_clust, FUN = hcut, method = "silhouette")
3.4.3 Interpreting the Dendograms
In dendrograms, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.
It’s also possible to draw the dendrogram with a border around the selected clusters by using rect.hclust() of R stats. The argument border is used to specify the border colors for the rectangles.
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.2)
rect.hclust(hclust_ward,
k = 3,
border = 2:5)
3.5 Visually-driven Hierarchical Clustering Analysis
In this section, we will perform visually-driven hiearchical clustering analysis by using heatmaply package.
With heatmaply, we are able to build both highly interactive cluster heatmap or static cluster heatmap.
3.5.1 Transforming the Data Frame into a Matrix
The data was loaded into a data frame, but it has to be a data matrix to make your heatmap.
The code chunk below will be used to transform nga_clust data frame into a data matrix.
nga_clust_mat <- data.matrix(nga_clust)3.5.2 Plotting Interactive Cluster Heatmap using heatmaply()
In the code chunk below, the heatmaply() of heatmaply package is used to build an interactive cluster heatmap.
heatmaply(normalize(nga_clust_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 3,
margins = c(NA,200,60,NA),
fontsize_row = 3,
fontsize_col = 5,
main="Geographic Segmentation of Nigeria LGA by water ponits indicators",
xlab = "Water Point Indicators",
ylab = "Nigeria LGA"
)Based on the heatmap above, we observe a distinct characteristic for areas that are grouped under the green branch is that they generally have a higher percentage of unacceptable quality of water. Those that are labeled under the pink branch have high percentage of acceptable water quality, moderate to high percentage of functional water points and low percentage of rural space.
3.5.3 Mapping the Clusters Formed
With closed examination of the dendragram above, we have decided to retain 3 clusters.
cutree() of R Base will be used in the code chunk below to derive a 3-cluster model.
groups <- as.factor(cutree(hclust_ward, k=3))The output is called groups. It is a list object.
In order to visualize the clusters, the groups object need to be appended onto nga simple feature object.
The code chunk below form the join in three steps:
the groups list object will be converted into a matrix;
cbind() is used to append groups matrix onto nga to produce an output simple feature object called
nga_sf_cluster; andrename of dplyr package is used to rename as.matrix.groups field as CLUSTER.
nga_filt <- nga %>%
filter(ADM2_EN %in% nga_wp$ADM2_EN)
nga_sf_cluster <- cbind(nga_filt, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)Next, qtm() of tmap package is used to plot the choropleth map showing the cluster formed.
qtm(nga_sf_cluster, "CLUSTER")
From the choropleth map, noticed that the clusters are dispersed all over Nigeria. This is perhaps due to the fact that no geospatial factors were considered when analyzing using hierarchical clustering algorithm.
Thus we will extend our analysis to include geospatial variables.
4 Spatially Constrained Clustering-skater Approach
In this section, we will derive spatially constrained cluster by using skater() method of spdep package.
4.1 Converting into SpatialPolygonsDataFrame
First, we need to convert nga_filt into SpatialPolygonsDataFrame. This is because SKATER function only support sp objects such as SpatialPolygonDataFrame.
The code chunk below uses as_Spatial() of sf package to convert shan_sf into a SpatialPolygonDataFrame called nga_sp.
nga_sp <- as_Spatial(nga_filt)4.2 Computing Neighbour List
Next, poly2nd() of spdep package will be used to compute the neighbours list from polygon list.
nga.nb <- poly2nb(nga_sp)
summary(nga.nb)Neighbour list object:
Number of regions: 761
Number of nonzero links: 4348
Percentage nonzero weights: 0.750793
Average number of links: 5.713535
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
4 16 57 122 177 137 121 71 39 11 4 1 1
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links
The above report indicates that there are 4 areas with only 1 neighbour and the average number of neigbours is approximately 6 (~5.713535). Area 496 has the most number of neighbours which is 14.
We can plot the neighbours list on nga_sp by using the code chunk below. Since we now can plot the community area boundaries as well, we plot this graph on top of the map. The first plot command gives the boundaries. This is followed by the plot of the neighbor list object, with coordinates applied to the original SpatialPolygonDataFrame (Nigeria LGA boundaries) to extract the centroids of the polygons. These are used as the nodes for the graph representation. We also set the color to blue and specify add=TRUE to plot the network on top of the boundaries.
plot(nga_sp,
border=grey(.5))
plot(nga.nb,
coordinates(nga_sp),
col="blue",
add=TRUE)
Note that if you plot the network first and then the boundaries, some of the areas will be clipped. This is because the plotting area is determined by the characteristics of the first plot. In this example, because the boundary map extends further than the graph, we plot it first.
4.3 Computing Minimum Spanning Tree
Conclusion
to try with other variables
to try other clustering alorithms
References
https://towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92
https://uc-r.github.io/hc_clustering